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Abstract 

Background: Bayesian Network (BN) is a powerful approach to reconstructing genetic regulatory networks from 
gene expression data. However, expression data by itself suffers from high noise and lack of power. Incorporating 
prior biological knowledge can improve the performance. As each type of prior knowledge on its own may be 
incomplete or limited by quality issues, integrating multiple sources of prior knowledge to utilize their consensus is 
desirable. 

Results: We introduce a new method to incorporate the quantitative information from multiple sources of prior 
knowledge. It first uses the Naive Bayesian classifier to assess the likelihood of functional linkage between gene 
pairs based on prior knowledge. In this study we included cocitation in PubMed and schematic similarity in Gene 
Ontology annotation. A candidate network edge reservoir is then created in which the copy number of each edge 
is proportional to the estimated likelihood of linkage between the two corresponding genes. In network simulation 
the Markov Chain Monte Carlo sampling algorithm is adopted, and samples from this reservoir at each iteration to 
generate new candidate networks. We evaluated the new algorithm using both simulated and real gene 
expression data including that from a yeast cell cycle and a mouse pancreas development/growth study. 
Incorporating prior knowledge led to a ~2 fold increase in the number of known transcription regulations 
recovered, without significant change in false positive rate. In contrast, without the prior knowledge BN modeling 
is not always better than a random selection, demonstrating the necessity in network modeling to supplement the 
gene expression data with additional information. 

Conclusion: our new development provides a statistical means to utilize the quantitative information in prior 
biological knowledge in the BN modeling of gene expression data, which significantly improves the performance. 



Background 

Reverse engineering of genetic networks will greatly facil- 
itate the dissection of cellular functions at the molecular 
level [1-3]. The time course gene expression study offers 
an ideal data source for transcription regulatory network 
modeling. However, in a typical microarray experiment 
usually up to tens of thousands of genes are measured in 
only several dozens or less samples, data from such 
experiments alone is significantly underpowered, leading 
to high rate of false positive predictions [4]. Network 
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reconstruction from microarray data is further limited by 
low data quality, noise and measurement errors [5]. 

Incorporating other types of data and existing knowl- 
edge of gene relationships into the network modeling pro- 
cess is a practical approach to overcome some of these 
problems. It has been proven that data integration and 
useful bias with relevant knowledge can improve the net- 
work prediction accuracy from gene expression data [6,7]. 
Among the various approaches of network modeling, 
Bayesian Networks (BN) have shown great promise and 
are receiving increasing attention [8]. BN is a graphic 
probabilistic model that describes multiple interacting 
quantities by a directed acyclic graph (DAG). The nodes 
in the network represent random variables (expression 
levels), and edges represent conditional dependencies 
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between nodes [9]. Learning a BN structure is to find a 
DAG that best matches the dataset, namely maximizing 
the posterior probability of DAG given data D: P (DAG| 
D). The sound probabilistic schematics allow BN to deal 
with the inherent stochasticity in gene expressions and the 
noise brought in by the microarray technology. Further- 
more, BN is capable of integrating prior knowledge into 
the system in a natural way [9,10]. 

A number of studies demonstrated that adding prior 
knowledge to BN improved the performance [4,11-14]. 
Many sources of data and information are useful to sup- 
plement the gene expression data, and they can be incor- 
porated at different steps of BN simulation, from prior 
structure definition to structure simulation and evaluation. 

Known protein-DNA interaction or other clues of the 
relationships between transcription factors and their target 
genes are useful to transcription regulatory network infer- 
ence. Hartemink et al. included data from the chromatin 
immunoprecipitation (ChIP) assay [15], and Tamada et al 
incorporated promoter sequence motif information [16], 
to define the prior probability of network structures. Infor- 
mation of other types of gene pair relationship has also 
been explored. Steele et al developed a gene-pair associa- 
tion score from the correlation of their concept profiles 
derived from literature, and utilized that to define the 
prior structure probabilities [12]. Larsen et al defined a 
Likelihood of Interaction (LOI) score, which measures the 
statistical significance of two genes interacting with each 
other according to their shared Gene Ontology (GO) 
information. They then restricted the candidate network 
edges (interactions) to those with significant p -values of 
LOI during the BN structure learning iterations [17,18]. 
By doing so, the quantitative information of the likelihood 
is not fully utilized in the network modeling. Djebbari and 
Quackenbush utilized literature, high-throughput protein- 
protein interaction (PPI) data, or the combination of both 
to define the seed (initial) network structure. They 
observed an improved ability of the BN analysis to learn 
gene interaction networks from the expression data [19]. 

Imoto et al formulated an novel approach to incorporate 
prior biological knowledge within the BN framework by 
adopting the energy concepts from statistical physics 
[20,21], which was later further extended by Husmeier and 
Werhli [22,23]. In this approach an energy function was 
first defined to measure the agreement between a candi- 
date network and the prior biological knowledge, and 
prior distribution of network structure is hence calculated 
using the Gibbs distribution in a canonical ensemble. 
Using this approach, the two groups examined several 
types of prior knowledge, including PPI, protein-DNA 
interaction, binding site information, literature, and Kyoto 
Encyclopedia of Genes and Genomes (KEGG) pathways 
[22-24]. The algorithms were validated using yeast gene 
expression data [20,21], and synthetic data [22]. 



Existing studies often utilize prior knowledge to con- 
struct the prior distribution of network, or initial network 
structure. It has been demonstrated that the sampling 
method during simulation also affects the performance of 
BN structure learning [25]. Though prior knowledge has 
been utilized to bias the sampling step, it is normally done 
through restricting the search space to sub regions, for 
instance, only simulate candidate structures whose signifi- 
cance is above a certain threshold according to prior 
knowledge [17,18]. 

In searching for the network structure (DAG) that maxi- 
mize P (DAG|D), the Markov Chain Monte Carlo 
(MCMC) approach is regarded better than greedy search- 
ing algorithms, especially for the microarray data with 
small sample size where there is often no single structure 
that is prominently better than others [9]. In this study we 
propose a new approach to incorporate prior knowledge 
in a quantitative way to bias the MCMC simulation of 
candidate structure. It utilizes information of functional 
linkage between gene pairs, assuming that functionally 
linked genes are likely to interact with each other. It is 
known that interacting proteins or genes often share simi- 
lar function, and participate in the same biological path- 
ways and processes [26]. Interaction has been utilized to 
infer functional linkage and annotate gene functions [27] . 
Increasing evidence suggests that the reverse is also fre- 
quently true [28]. In our algorithm a probability score is 
first calculated that measures how likely two genes are 
functionally linked based on prior knowledge; A candidate 
edge reservoir is then constructed where the number of 
copies of each edge is proportional to this probability 
score; The reservoir is in turn used for sampling candidate 
network structure during the MCMC simulation. This 
way the quantitative information of the potential gene pair 
link predicted by prior knowledge is retained. 

We will consider two type of prior knowledge: co-citation 
in PubMed literature and similarity in ontological annota- 
tion according GO http://www.geneontology.org/. We will 
demonstrate they both contain information of functional 
linkage. The performance of the new algorithm is evaluated 
using a synthetic data set as well as data from two real 
microarray experiments: the yeast cell cycle study, and the 
mouse pancreas development/growth study. We will 
demonstrate that including the prior knowledge signifi- 
cantly improves the performance of BN modeling of gene 
expression data. 

Results 

Algorithm 

BN is a graphical model to capture complex relationships 
among a set of random variables {X±, X 2 ,...,X n } encoding 
the Markov assumption, each node representing a vari- 
able. In the context of gene network modeling, each node 
represents a gene, while gene interactions are represented 
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by directed edges between nodes. Each variable X t in the 
DAG is conditionally independent of its non-descendants 
given its set of parents. Mathematically the joint distribu- 
tion of the DAG can be decomposed into a product 
form as: 

n 

p(dag) = p(Xi,x 2/ . . . ,x„) = Ppq/no (i) 

i=i 

where 11/ denotes the parent set of the variable X t . 
This is referred as the chain rule for BNs [9,10]. Learn- 
ing a BN structure is to find a DAG that best matches 
the dataset, namely maximizing the posterior probability 
of DAG given data P (DAG|D). Here we adopt the sam- 
pling-based approach to Bayesian inference, and sample 
network structures from a candidate edge reservoir with 
the MCMC network learning method. In the reservoir 
the edge representation is proportional to the likelihood 
of the two genes being functionally linked according to 
prior knowledge. This way, the edges between the 
strongly-related gene pairs have higher chance to be 
proposed as part of the candidate network. The overall 
design is given in Figure 1. The major steps included: 

1. Determine the probability of functional link p iink 
between each gene pairs 

1.1 Calculate GO schematic similarity 

1.2 Calculate p value of PubMed co-citation. 

1.3 Integrate GO and PubMed information using 
the Naive BN to determine p link . 

2. Construct candidate network edge reservoir in 
which copy number of each edge is proportional to 
the ^ii n k of the corresponding gene pair. 



3. Learn network structure using the MCMC algo- 
rithm through sampling the candidate network edge 
reservoir. 

At each step of the iteration, the proposed network is 
retained with an acceptance probability that is deter- 
mined by the relative posterior of the proposed versus 
current network, penalized by the network complexity 
[29,30]. In calculating the posterior we use the BDe 
(Bayesian Dirichlet equivalence) scoring metric [10,31]. 
The prior distribution is assumed to be uniform. 

To evaluate the performance of our BN algorithm, and 
the benefit of adding prior knowledge, we compare it to 
two alternative approaches: (1) Plain BN. In each itera- 
tion, a new network is proposed by randomly changing 
one edge in the current network. (2) The method devel- 
oped by Husmeier and Werhli [22,23]. 

GO schematic similarity and significance of PubMed co- 
citation 

GO annotation and gene citation database (PubMed) 
were downloaded from ftp://ftp.ncbi.nlm.nih.gov/gene/ 
DATA. Schematic similarity in GO taxonomy was first 
calculated for each gene pair using the approach pro- 
posed by Cao et al [32], which calculates the shared 
information content of the GO terms. The value of this 
measure ranges between [0 1], with 0 being no similarity, 
and 1 being maximum similarity. The GO similarity 
between each gene pair is defined to be the maximum 
schematic similarity of all the GO terms they share. 

For a given pair of genes, the total number of PubMed 
abstracts in which each gene appears (n and m, respec- 
tively), and in which both appear (/<) were determined. 
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Figure 1 The framework of our BN modeling to incorporate the quantitative information in prior knowledge. 
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N(Edgey) = Ceil(10 x p link (z,j)) (3) 

where Cei\(x) is the smallest integer no less than x. 
In this definition, any gene pair will be represented 
at least once and at most 10 times. The edges of 
gene pairs with higher pu nk will appear more fre- 
quently in the edge reservoir, and hence enjoy a 
higher chance to be selected during the network 
structure learning. 

Implementation 

Our BN simulation algorithm is implemented in Matlab 
utilizing Kevin Murphy's BNT package [33]bnt.google- 
code.com, and is summarized in Table 1. Note that 
steps 1 and 3.1 contain unique features that separate 
our approach from others. The source code is available 
upon request. The networks were visualized with Cytos- 
cape [34]. 



Table 1 Implemention of the new BN structure learning algorithm 

Input: 

n: number of nodes in the network. 
D: discretized expression data matrix. 

Burnln: number of steps to take before drawing sample networks for evaluation. Default value: 50 times the size of the sampling reservoir, 
njteration: number of iterations. Default value: 80 times the size of the sampling reservoir. 
A_samples: interval of sample networks being collected from the chain after burn-in. Default 
value: 1000. 

maxFanln: maximum number of parents of a node. 
Output: 

A set of DAGs after reaching the max iteration step. 
An average DAG in the form of a matrix. 

Steps 

1. Create a sampling edge reservoir based on p /ink . 

2. Set all elements of the adjacency matrix for the initial DAG to 0. 

3. for loop_index = 1: njteration do 

(1) randomly select a element edge(ij) from the edge sampling reservoir, corresponding to gene pair (i,j). 

(2) if edge(ij) exists in the current DAG, delete the edge; else if edge(j,i) exists in the current DAG, reverse edge(j,i) to edge(j,i); else add 
edge(ij). We name these operations as "delete", "reverse" and "add", respectively. 

(3) check whether the newly proposed DAG remains acyclic and satisfy the maxFanln rules to nodes (i,j). If not, keep the current DAG and 
give up proposed DAG, go to (1). 

(4) calculate log value of the marginal likelihood (LL)* of the expression data D of node j and its parents given the current DAG (LL_old) or 
the proposed DAG (LL_new) and define bf1 = exp(LL_new - LL_old). 

(5) if the operation is "delete" or "add", bf2 = 1; if the operation is "reverse", calculate bf2 for node i in same way as for node j in (4). 

(6) calculate the prior probability* of current DAG (prior_old) and propose DAG (prior_new); calculate the Metropolis-Hastings ratio (/? HM ) of 
the two DAGs; generate a random number u between 0 to 1, if bf1*bf2*prior_new/prior_old<u*/? H M/ keep the current DAG and give up 
proposed DAG, go to (1). 

(7) when loop_index>Burnln and (loopj'ndex-Bumln) is exactly divisible by A_samples, record the proposed DAG and its posterior 
probability. 

4. End of loop, calculate the average DAG in the form of a matrix, where the elements are given by the averaged edges of all recorded DAGs 
weighted by their posterior probabilities. 

^Details of the definition of marginal likelihood, and how to calculate LL, prior probability of DAG, can be found in [10,31]. 



The probability of co-citation frequency observed by 
random chance is calculated by 

fe-i 

pPubMed(# of co - citation > k\n, m, N) = 1 — p{i\n, m, N) (2) 

i=0 

n\(N — ri)\m\(N — m)\ 

where pit \n,m,N) = r — ; — , 

FU ; (n - i)\i\(m - i)!(N - n - m + i)!N!' 

and N is the total number of abstracts in PubMed [1,2]. 

Construction of the candidate network edge reservoir 

We used the Naive Bayesian network to integrate the 
GO and co-citation information, and a simple Bayesian 
naive classifier to predict the functional linkage prob- 
ability pu nk for all gene pairs. Note that the prior knowl- 
edge of functional linkage is undirected, i.e. pu n k (i, /) = 
Punk (j> 0* An edge sampling reservoir was constructed, 
in which the number of replicates for the edge between 
gene i and ; N (Edge^-) is in proportion to their p hnk : 
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Validation 

Utility of GO similarity and PubMed co-citation in 
discovering functional linkage between gene pairs 

Lee et al developed an approach to evaluate if gene-pair 
functional relationships can be predicted by a certain 
type of high-throughput genomic data (gene expression, 
PPI, ChlP-chip, etc) [35,36]. Assuming that p(L\D) and 
(~ L\D) denote the probabilities of gene pairs to share or 
not share functional annotation given that they are linked 
by data D (for instance, co-expressed, sharing PPI, pro- 
tein of one gene binds to the promoter of the other, etc), 
and p(L) and p(~ L) represent the prior probabilities of 
sharing and not sharing functional annotation, they pro- 
posed a log likelihood score [35,36]: 



V P(L)/P(~L) ) 



(4) 



to describe the utility of data D in functional linkage 
inference. An LLS close to 0 suggest that the data is not 
more informative than random pairing, whilst higher 
positive values of LLS indicates that data D contains 
more information of functional linkage. 

We adopted equation (4) to evaluate whether GO sche- 
matic similarity and PubMed co-citation were useful in 
identifying functional linkage. The KEGG http://www.gen- 
ome.ad.jp/KEGG and Munich Information Center for Pro- 
tein Sequences (MIPS, mips.gsf.de/) database [1,2] were 
used to construct the benchmarks of functional linkage. 
These databases were chosen for their high quality [37]. In 
this study we utilized yeast and mouse gene expression 
data to validate our algorithm. For each species, the posi- 
tive control set consists of randomly sampled 5% (43,761 
for yeast, and 35,424 for mouse) of all gene pairs that are 
in the same KEGG pathways [38]. The choice of 5% rather 
than all is to lower the computational complexity. The 
negative control set was constructed with gene pairs that 
encode proteins localized in different cellular compart- 
ments, with the underlying assumption that they are func- 
tionally unrelated and do not interact with each other. 
Four categories in the MIPS annotation [39] were utilized: 
70.03 cytoplasm, 70.10 nucleus, 70.16 mitochondrion, and 
70.27 extracellular/secretion proteins. 

Again we only kept 5% of all possible gene pairs, total- 
ing 112,693 for yeast and 531,089 for mouse, respectively. 



The same benchmark sets were also utilized to train the 
Naive Bayesian classifier when calculating /7 link . 

The LLS of co-citation in discovering functional link- 
age is then determined by: 



LLSp u bMed = In 



P(£|ppubMed)/P(~ VppubMed) 

P{L)/P(~L) 



(5) 



The LLS of GO schematic similarity was performed in 
the similar fashion. The LLS value for gene pair sets in 
different ranges of GO similarity and co-citation p-value 
were given in Table 2. Gene pairs sets with higher GO 
similarity or PubMed co-citation significance, have more 
positive LLS values, and vice versa. Note that gene pairs 
with negative LLS means they are less likely to be func- 
tionally linked than random pairs, which is expected if 
they share low GO similarity or co-citation. The results 
suggest that PubMed Co-citation and GO similarity are 
efficient at discriminating functionally linked gene pairs 
from not linked ones. 

We found that there is a marginal dependence 
between the GO similarity and PubMed co-citation 
(Fisher's Z test, p~0.1). Theoretically naive Bayesian 
classifier is optimal when the attributes are independent 
given class. However, empirical studies have shown that 
the classifier still performs well in many domains when 
there is moderate attribute dependences [40]. The weak 
dependence between them indicates that the naive Baye- 
sian Network is an appropriate choice to integrate their 
information [41]. Interestingly, the GO and MIPS cate- 
gories, which are both functional annotations, also only 
depend weakly on each other. This may be because the 
present annotations are far from being perfect and com- 
plete [42]. 

Utility of functional linkage information to interaction 
network modeling 

The distribution of p iink for yeast gene pairs is given in 
Figure 2. Note that only a small proportion of gene 
pairs share high values of pu n ] 0 for about 92% of the 
gene pairs this value is less than 0.2. This indicates that 
most gene pairs share no functional linkage, consistent 
with the fact that gene networks are usually sparse. The 
candidate edge reservoir is constructed according to 
equation 3, and the MCMC samples this distribution to 



Table 2 GO and PubMed citation contain information of functional linkage 


interval 


GO similarity LLS, yeast 


LLS, mouse 


interval 


-log 10 (p P ubMed)LLS / yeast 


LLS, mouse 


[1, 1] 


1.51 


1.62 


(4 oo) 


0.25 


0.37 


[0.2, 1) 


-0.71 


-0.99 


(3 4] 


0.13 


0.14 


[0, 0.2) 


-1.61 


-2.2 


(1 3] 


0.07 


0.19 








[0 1] 


-3.4 


-3.6 



Log Likelihood Scores of functional linkage in yeast and mouse, for gene pair in different value interval of GO similarity and PubMed co-citation significance. 
Gene pairs with higher GO similarity or significance of co-citation are more likely to be functionally linked. 



Gao and Wang BMC Bioinformatics 201 1, 12:359 
http://www.biomedcentral.eom/1 471 -21 05/1 2/359 



Page 6 of 1 3 



o 



100 
90 
80 
70 
60 
50 
40 
30 
20 
10 
0, 



-all gene pairs 

■■interactions predicted with prior knowledge 
interactions predicted without prior knowledge 



0.2 



0.4 



0.6 



0.8 



Figure 2 Distribution of functional linkage probability for all 
possible gene pairs, and for predicted interactions with and 
without prior knowledge. 
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Figure 3 ROC curve indicating that functional linkage contains 


information for interaction. Plotted is the performance of p jink as 


a classifier to identify yeast TF -target pairs defined by ChlP-chip. 



propose new candidate network structure at each itera- 
tion. In Figure 2 we have also included the distribution 
for gene pairs predicted to be interacting to each other, 
with and without the prior knowledge. Among all possi- 
ble gene pairs, only -8% with pi ink 0.6. In contrast, this 
proportion increases to 28% among the predicted inter- 
actions. It indicates that the prior knowledge did affect 
the outcome of the BN learning. The results from the 
other data sets are similar. 

The assumption of incorporating prior knowledge of 
functional linkage is that they can help network model- 
ing. Existing data from yeast revealed that genes sharing 
the same GO attribute interact genetically more often 
than expected by chance (p < 0.05) [43,44] . In a very con- 
servative estimate, over -12% of the genetic interactions 
are comprised of genes with identical GO annotation 
(a 12 fold enhancement over what expected by chance, 
p < 10" 12 ); and over 27% are between genes with similar 
or identical GO annotations (an 8 fold enhancement, 
p < 10 10 ). 

We examined whether p Unk can potentially discriminate 
interacting gene pairs from non-interacting ones, using 
the receiver operating characteristic (ROC) curve. ROC is 
a graphical plot of the sensitivity versus (1 -specificity), 
namely the fraction of true positives versus the fraction 
of false positives, as the discrimination threshold of a 
classifier is varied. The area under curve (AUC) reflects 
the performance. The ROC of a random classifier would 
be a 45° line with AUC = 0.5. Figure 3 presents the ROC 
plot for the nine yeast cell cycle regulating transcription 
factors (TF): Fkhl, Fkh2, Nddl, Mcml, Ace2, Swi5, 
Mbpl, Swi4, and Swi6, and their targets identified using 
the ChlP-chip technology [45]. The AUC of 0.6064 



indicating that is positively correlated with interaction 
and therefore useful in interaction inference. 

Convergence of simulation 

In Figure 4A we plot the acceptance ratio versus number 
of MCMC steps in the yeast cell cycle dataset. Obviously 
in the later steps the probability to accept the new pro- 
posed DAG is small and flattens. The results from the 
other datasets are similar. In addition, the MCMC simu- 
lation was repeated 20 times with independent initializa- 
tions, and consistency in the marginal posterior 
probabilities was examined. We found that they corre- 
lated well between different runs: 0.83 ± 0.11 for the 
simulated dataset, 0.68 ± 0.10 for the yeast data set, and 
0.51 ± 0.26 for the mouse pancreas dataset. Figure 4B 
presents the scatter plot of the edge posterior probability 
from two typical runs that simulate the yeast dataset. 

Validation using simulated data 

In our network inference, the MCMC learning simulation 
is repeated 20 times with independent initializations and 
an interaction will be considered in the final network if it 
is observed more than 15 times. Our new BN algorithm 
was first tested in a simulated time course (50 time 
points) gene expression dataset of an artificial network 
generated using SynTReN [46]. This network contains 76 
genes, of which 24 act as regulators with a total of 124 
regulatory relationships (i.e. 124 edges). The results are 
summarized in Table 3, 2n d column. It demonstrates that 
incorporating the functional linkage as prior knowledge 
allows the identification of a significantly higher number, 
21 versus 14, of the true gene-gene relationships com- 
pared with the plain BN modeling of gene expression 
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Figure 4 Convergence of simulation. (A) Acceptance ratio versus the number of MCMC steps (B) scatter plot of the marginal posterior 
probabilities of the edges, obtained from two separate MCMC simulations of the yeast cell cycle data. 



data only. A random network of the same number of 
edges was also created for the 76 genes [47]. The 
improvement of BN with prior knowledge over random 
is significant (p < 0.01, Table 3), while without prior 
knowledge it is not (p~0.2, Table 3). 

Validation using the yeast cell cycle data 

Next the new algorithm was applied to one of the Stan- 
ford yeast cell cycle data http://genome-www.stanford. 
edu/cellcycle/, where the cells from a cdcl5 temperature 
sensitive mutant were studied [48]. To evaluate the per- 
formance, we compared the predicted interactions from 
our algorithm to the annotated interactions in BIND 
http://bind.ca[49], and the transcription regulation pre- 
dicted by the ChlP-chip data [45]. Tables 4, 5 list the 



benchmark interactions for the 107 yeast cell cycle 
genes that were recovered by the BN modeling. The sta- 
tistical results are summarized in Table 3, columns 3-4. 

Evidently, our method is capable of identifying a higher 
number of the positive benchmarks compared with the 
plain BN without prior knowledge. When evaluated with 
the BIND annotation, the number of correctly identified 
interactions doubled from 13 to 26 (p~0.13, x 2 ~2.28). The 
plain BN actually did not perform better than random 
selection (p~0.11). In contrast, BN with prior knowledge 
performed significantly better than random selection with 
X 2 = 24.5, p < 0.001. When evaluated with the ChlP-chip 
data, the story is similar. The number of correctly identi- 
fied gene regulatory relationships increased from 11 to 23 
with the addition of prior knowledge (p < 0.01, x 2 = 6.71). 



Table 3 The improvement in network modeling with the addition of prior knowledge 



Data set 


Simulated 
data 


Yeast cell cycle study, 
benchmark from BIND 


Yeast cell cycle study, 
benchmark from ChlP-chip 


Mouse 

pancreas 

study 


Number of genes 


76 


107 


107 


36 


Number of established regulations 


124 


114 


190 


24 


Number of possible regulations 


76*75 = 5700 


107*106/2 = 5671* 


9*106 = 954 


36*35 = 1260 


Number of known regulations recovered with 
(without) prior knowledge 


21 (14) 


26 (13) 


23 (11) 


12 (6) 


Total number of regulations predicted, with 
(without) prior knowledge 


503 (440) 


436 (387) 


58 (33) 


322 (297) 


Improvement over plain BN 


1 2 = 0.36, 
p-0.54 


% 2 = 2.28, p < 0.13 


X 2 = 0.04, p~0.84 


1 2 = 0.98, 
p-0.32 


Improvement: over random selection 


1 2 = 7.32, 
p < 0.01 


X 2 = 24.5, p < 0.001 


X 2 = 6.71, p < 0.01 


X 2 = 2.87, 
p < 0.09 


Plain BN over random 
selection 


X 2 = 1 -58, 
p~0.2 


X 2 = 2.42, p~0.11 


X 2 = 1.6, p~0.2 


x 2 = o.oi, 

p~0.8 



* We ignored edge direction with comparing to BIND since it contains both directed and undirected interactions. 
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Table 4 Predicted yeast gene regulatory relationships 
that are annotated in BIND 





BN with prior knowledge 




HTA1->HHT1 


FUS1->FAR1 


FKH2->CLB2 


GAS1->SWI4 


SWI5->FKH1 


DPB3->CDC45 


DPB2->DPB3 


CLN2->CLN3 


ASF1->HHF1 


GAS1->KRE6 


CLN3->CLB6 


CDC14->SIC1 


SWI4->MBP1 


MSH6->POL30 


CLB6->CLN1 


SWI4^CHS3 


KAR3->NUM1 


HHF1->HHT1 


M0B1->DBF2 


RFA1->RFA3 


CLB1->CLB3 


CLN1->CLN3 


CDC45^CDC6 


CLB1->CLB5 


HHF1->HTB2 


HPR5->RAD54 








BN Without prior knowledge 




HTA1->HHT1 


FUS1->FAR1 


FKH2->CLB2 


GAS1->SWI4 


SWI5->FKH1 


DPB3->CDC45 


DPB2->DPB3 


CLN3->CLN2 


DBF4->CDC5 


CDC8->CIK1 


CDC6->CDC45 


CLB3-»CDC6 


SIC1->CDC14 









Relationships in bold font are predicted both with and without prior 
knowledge. 



Without the prior knowledge, the plain BN is not different 
from random selection (p~0.1). 

Figure 5A-5C shows the ROC curves that give a more 
quantitative view of the performance of BN with/with- 
out prior knowledge, and of the Werhli and Husmeiers 
algorithm [22,23], in detecting TF-target gene interac- 
tions. Incorporation of prior knowledge significantly 
improved the performance with higher AUC. Our algo- 
rithm performed slightly better than Werhli and 
Husmeiers. 

Validation using mouse pancreas development data 

We also validated our algorithm using a mammal data- 
set. The experiment profiled gene expression changes in 
pancreas during embryonic development or during com- 
pensatory growth after partial pancreatectomy. Elucidat- 
ing the networks is key to understand the complex 
nature of pancreas development and function [50,51]. A 



Table 5 Predicted yeast gene regulatory relationships 
that are confirmed by ChlP-chip 





BN with prior knowledge 




FKH2->HHF1 


FKH2->CLB2 


SWI6-»CLN1 


FKH2->HHT1 


SWI5->FKH1 


SWI6->HO 


SWI6->POL30 


SWI4->MFA2 


FKH1->SWE1 


FKH2->CDC6 


FKH2->SWI4 


SWI4->PSA1 


SWI6->HHT1 


SWI5->ASH1 


SWI6->CLN2 


FKH2->SWE1 


FKH2->HPR5 


SWI6->RAD54 


FKH1->RAD51 


SWI6->HHF1 


SWI6->AGA1 


SWI4->AGA1 


SWI4->MBP1 






BN without 


prior knowledge 




SWI6->POL30 


SWI6->CLN1 


FKH2— »HHT1 


SWI5->FKH1 


FKH2->HHF1 


SWI4->MFA2 


SWI6->HO 


FKH2->CLB2 


SWI4->TIR1 


FKH1->CDC6 


FKH1->CDC20 





Relationships in bold font are predicted both with and without prior 
knowledge. 



number of efforts have been made to manually annotate 
the key transcription factors and the gene networks they 
regulate based on low-throughput data, nicely reviewed 
by Servitja and Ferrer [52]. In Table 6, we list the 24 
experimentally confirmed gene-gene regulatory relation- 
ships [52], and their network is depicted in Figure 6A. 
With prior knowledge BN modeling of the expression 
data is able to recover half of them (12), as shown in 
Figure 6C and Table 6. In contrast, the plain BN is only 
able to identify 6 of them (Figure 6B). This is again a 
-two-fold enhancement. In Figures 7A-7C the ROC 
curves are presented. Incorporation of prior knowledge 
significantly improved the ability to detect known inter- 
actions. Our algorithm performed comparably to Werhli 
and Husmeiers. 

In Additional file 1, we listed the GO similarity and 
PubMed co-citation of the gene pairs with known regu- 
latory relationships that were missed by plain BN. 
Clearly, almost all of them have high GO similarity and 
share a significant number of co-citations. Adding the 
functional linkage as prior knowledge helped to recover 
them. 

Discussion 

In this study we proposed a new algorithm to quantita- 
tively utilize prior biological knowledge in the network 
modeling of gene expression data. First the functional 
linkage of gene pairs was assessed based on multiple data 
sources using the naive Bayesian classifier. The result was 
then utilized to construct a candidate network edge 
reservoir, where the number of replicate edges between 
each gene pair was proportional to their function linkage 
probability. During simulation new candidate network 
structure was formed by sampling from this reservoir at 
each iteration. Since the edges of gene pairs with stronger 
functional linkage had more representations in the reser- 
voir, these biologically meaningful edges enjoyed a 
preferential treatment in network simulation. With both 
the simulated and real gene expression data, we demon- 
strated that incorporating the prior knowledge signifi- 
cantly improved the network modeling performance. 
More information of the gene interaction network could 
be extracted from the microarray data with higher accu- 
racy. In contrast, in all datasets, without the prior knowl- 
edge, though the number of benchmark regulations 
recovered is more than a random selection, the improve- 
ment is not statistically significant, demonstrating the 
necessity to supplement the gene expression data with 
additional information. This finding that plain BN did 
not perform better than random selection was not unex- 
pected, similar observations was recently reported for a 
number of publically available reverse-engineering algo- 
rithms when gene expression data is the sole source of 
information [47]. 
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false positive rate false positive rate false positive rate 

Figure 5 ROC curves for the network modeling of the yeast cell cycle data using plain BN (A), Werhli and Husmeier's (B), and our 
algorithm (C). ChlP-chip binding data were used as benchmark. Adding prior knowledge significantly improved BN performance at identifying 
the TF -target pairs. 



Our algorithm provides a practical way to integrate the 
probabilistic biological knowledge that is different from 
previous efforts by others [2]. The quantitative nature 
makes it capable to handle soft constraints. Using the 
approach by Werhli and Husmeier for instance [22,23], 
we differ in several key steps. First, they encode multiple 
sources of prior knowledge in a weighted sum via an 
energy function; we integrate information from multiple 
sources through a Bayesian classifier. Furthermore, in 



our approach the MCMC samples from a candidate edge 
distribution defined by the prior knowledge, rather than 
from the network posterior distribution where the net- 
work prior is defined by the prior knowledge. Our algo- 
rithm utilizes the prior knowledge at interaction level, 
while theirs at the network level. Finally the Werhli and 
Husmeier approach is more computational intensive. To 
reduce the computational complexity, they sum over all 
parent configurations of each node and limit the number 



Table 6 Established pancreas gene regulatory relationships that are identified by BN modeling 

Known regulatory relationship Identified by BN modeling with prior knowledge Identified by the plain BN without prior knowledge 



Hes1^Neurog3 


V 




Hnf4a->Tcf1 


V 


V 


Pdx1->Gck 


V 




Pdx1->Hnf4a 






Pdxl^-lapp 






Pdx1->lns2 


V 




Pdx1->Nr5a2 


V 


V 


Mafb^lns2 






Mafb->Pdx1 






Neurog3->Nkx2-2 


V 




Nkx2-2->Gck 


V 


V 


Nkx2-2->lapp 






Nkx2-2->lns2 






0necut1^Pdx1 






Onecut1^Neurog3 






0necut1^Tcf1 


V 




Pax6— >Gck 






Pax6^lapp 


V 


V 


Pax6^lns2 


V 


V 


Pax6->Pdx1 


V 


V 


Tcf1->Hnf4a 






Tcf1->Pdx1 






Tcf1->Pklr 






Tcf1->Slc2a2 


V 





BN with prior knowledge can recover half of the experimentally confirmed transcriptional regulations during mouse pancreas development, two times more than 
the plain BN without prior knowledge. 
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of parents of each node to 3 or less; the complexity of 
this operation is ( N ~ l ) (where N is size of the network, 
and m the maximum Fanln) [23]. We find that it is still 
memory consuming for networks of moderate or large 
sizes. For instance, a Dell Optiplex 755 with 2GHZ DUO 
CPU, 3.25 GB RAM ran out of memory when simulating 
the 107-gene yeast network. Our algorithm does not have 
this problem. 

We used two sources of prior evidence of functional 
linkage to assist network modeling: the PubMed co-cita- 
tion and GO schematic similarity. However, our frame- 
work by design allows the integration of other types of 
data or knowledge, for instance, high throughput geno- 
mic data including PPI and ChlP-chip; gene-gene rela- 
tionships derived from advanced methods including text 
mining [53], database curation, and computational mod- 
eling of sequence information; and many other sources. 
It has been demonstrated that the degree of improve- 
ment brought in by prior knowledge highly depends on 
the quality of the information being added [54]. Low 
quality prior knowledge could even lower the perfor- 
mance of BN [54]. Presently, most of the available prior 
knowledge each on its own suffers from high false 



positive rate and being incomplete, which can limit their 
efficacy in network modeling. Integration of data from 
different sources and utilizing their consensus provides 
an effective means to deal with this issue [1,2]. A caveat 
here is, when considering more sources of data, the 
inter-dependency among them need to be scrutinized 
more carefully, and maybe a more sophisticated integra- 
tion method than the naive Bayesian classifier is needed. 

A number of different approaches have been devel- 
oped to integrate multiple sources of prior information 
in the BN modeling of gene expression data, at the dif- 
ferent steps of the simulation process [4,11-14]. It would 
be of interest to compare the efficiency of the different 
approaches, investigate whether the optimal approach 
depends on the types of prior knowledge, and if the dif- 
ferent approaches can be combined for a most efficient 
utilization of prior knowledge in network modeling. 

Conclusion 

In this paper we proposed a new algorithm to integrate 
and utilize the prior biological knowledge in the BN 
modeling of gene expression data. Our study demon- 
strated that incorporating prior knowledge at the step of 




Figure 7 ROC curves for the pancreas development data with plain BN (A), Werhli and Husmeier's algorithm (B), and our approach 

(C). 24 experimentally confirmed interactions were used as benchmark. 
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network structure simulation is an efficient way to pre- 
serve the quantitative information in it, and to improve 
the performance of network modeling. 

Methods 

Preparation of gene expression data for algorithm 

validation 

Simulated data 

The simulated time course gene expression dataset was 
generated using SynTReN [46] for a artificial network 
with 76 genes, of which 24 act as regulators with a total 
of 124 regulatory relationships (i.e. 124 edges). The total 
number of time points is 50. All parameters of SynTReN 
were set to default values [46], except number of corre- 
lated inputs, which was set to 50%. The topological 
structure and inner interacting relationships are sampled 
from the characteristics of the yeast transcriptional net- 
work, therefore the results will be indicative of the algo- 
rithm performance on real data. 
Yeast cell cycle study 

Yeast cell cycle gene expression data were downloaded 
from http://genome-www.stanford.edu/cellcycle/. These 
studies [48,55] profiled expression changes in 6178 
genes at -20 time points under each condition follow- 
ing alpha factor arrest (18 time points from 0-119 min- 
utes), elutriation ELU (14 time points from 0-390 
minutes), and arrest of a cdcl5 (24 time points from 
10-290 minutes) and a cdc28 (28 time points from 0- 
160 minutes) temperature sensitive mutant. Many genes 
have missing data points. The cdc28 data is the most 
severely affected, -80% of genes contains at least 1 
missing values. For the remaining three datasets, it ran- 
ged 6-27%. In this study, we chose the cdcl5 dataset, as 
it contains the most number of time points out of the 
three [56]. Network modeling was performed on the 
107 known cell cycle genes [57]. The list is given in 
Table 7. These are the genes that most likely to have 
interesting interactions during the time course being 
studied. 

Mouse pancreas development and regeneration after 
damage 

The pancreas development and growth expression data was 
downloaded from the RNA Abundance Database http:// 
www.cbil.upenn.edu/RAD, with study IDs 2 and 1790. 
Study 2 profiled mouse pancreas gene expression at six dif- 
ferent developmental time points: embryonic day 14.5, 
16.5, 18.5, at birth, at postnatal day 7, and at adulthood. 4 
samples at E14.5, and 6 at all the following time points, 
totaling 34 samples. Study 1790 profiled gene expression in 
mice pancreas following partial pancreatectomy and Exen- 
din-4 treatment. Exendin-4 is a glucagon-like peptide- 1 
receptor agonist that augments the pancreatic islet beta- 
cell mass by increasing beta-cell neogenesis and prolifera- 
tion and by reducing apoptosis. Mice underwent 50% 



Table 7 The 107 Yeast cell cycle genes that were 
simulated for their network structure 



ACE2 


CLB6 


HHF2 


MSH6 


RFA3 


(ojUozzJ 


(oj jUUjJ 


(ojj/U I ) 


fPC, 1 AV1 ^ 
(OJ I 0/ I ) 


(OJ jzDDJ 


AGA1 


CLN1 


HHT1 


MST1 


RME1 


(CO J /OUJ 


(OJ JZjy) 


\ODZZyj) 


VOJ J04UJ 


for--)QTr\ 

(cozyjjj 


ASE1 


CLN2 


HHT1 


NDD1 


RNR1 


(OJ4ZZjJ 


VOJ JO 1 y) 


VOJJ/UUJ 


(OJ4J J4J 


(OJDoU I ) 


ASF1 


CLN3 


HHT2 


NUM1 


RNR3 


fPC33 J7 1 ! 
(OJOOZ/J 


(Q^ 1 1 Q1 

voj i i y i ) 


foe J JQ^ 


foe 1 H~7\ 
Voj 1 1 Zl) 


fQ^A 7 AA\ 
(OJ4/44J 


ASF2 


CTS1 


HHT2 


PCL1 


SED1 


for 1 

(OJ I DOUJ 


(ojuyyzj 


(ojj/UUJ 


(ojj4z/J 


(OJ I o4yj 


ASH1 


CWP1 


HO (851371) 


PCL2 


SIC1 


(oj jOJUJ 


(OJ J/OOj 




foe 1 a 3rr\ 
Voj I 4oUJ 


(ojU/DoJ 


CDC14 


CWP2 


HSL1 


PCL9 


SPC42 


focn^oc'i 
(ojUjojJ 


(oj j/OJJ 


(oj j/dUJ 


fpc, 1 o~7c:\ 


(OJ JOZ4J 


CDC20 


DBF2 


HTA1 


PDS1 


SP012 


(OJZ/OZJ 


(ojzyo4j 


fP^ 1 P1 1 ^ 
VOJ I O I I ) 


fOC 1 f.Q'i \ 

[oD \ oy i ) 


(OJDJ J/ ) 


CDC21 


DBF4 


HTA2 


PMS1 


SST2 


fQ^/l 1A 1 
(OJ4Z4 I ) 


foci A"}^ 
VOJ 1 OZd) 


(OJZZojJ 


(OJ J04ZJ 


for i i ~70\ 
(OJ 1 1 / J) 


CDC45 


DPB2 


HTB1 


P0L1 


STE2 


(oju/yoj 


(ojOjUJJ 


fQR 1 Q1 PA 
VOJ I O I UJ 


(oj joz 1 ) 


(ojUj I o) 


CDC5 


DPB3 


HTB2 


P0L12 


SWE1 


(oj jU 1 d) 


for 1^Q.C\\ 
(OJZJOUJ 


fP^HP/H 
(OJZZo4J 


(ojzZ4jJ 


(OJ JZJZj 


CDC6 


EGT2 


KAR3 


POL2 


SWI4 


fproo AA\ 
VOJJZ44J 


(OJ J DOy) 


(OJOZOjJ 


(OJ J4jyj 


(OJD04/ ) 


CDC8 


FAR1 


KAR4 


POL30 


SWI5 


(CO J JZUJ 


VOJ jZojJ 


(OJUjUjJ 


(OJZjoJJ 


for i 77/n 
(OJ 1 / ZH) 


CDC9 


FKH1 


KIN3 


PRI1 


SWI6 


(OJ \ Dy \ ) 


(OJ40/ D) 


VOJ 1 Zl J) 


(0J40ZJJ 


fprrip7Q\ 
(oJUo/yj 


CHS1 


FKH2 


KRE6 


PRI2 


TEC1 


fper r-")Q\ 


(oJ JOJOJ 


fP^A JP7"\ 
(oJOzo/J 


(OJ JOZ I ) 


(OJZO/ / ) 


CHS3 


FKS1 


MBP1 


PSA1 


TIP1 


foe 13 1 "H 
(ojzo I I ) 


for 1 n^,^ 
voj 1 Uj Jj 


VOJ I jUjJ 


fpc i cn/n 

(OJ 1 JU4J 


(ojzjjyj 


CIK1 


FUS1 


MCD1 


RAD 17 


TIR1 


(855238) 


(850330) 


(851561) 


(854550) 


(856729) 


CLB1 


GAS1 


MCM1 


RAD27 


UNG1 


(853002) 


(855355) 


(855060) 


(853747) 


(854987) 


CLB2 


GIC2 


MFA2 


RAD51 


YR02 


(856236) 


(851904) 


(855577) 


(856831) 


(852343) 


CLB3 


HHF1 


MNN1 


RAD54 




(851400) 


(852294) 


(856718) 


(852713) 




CLB4 


HHF1 


MOB1 


RFA1 




(850907) 


(855701) 


(854700) 


(851266) 




CLB5 


HHF2 


MSH2 


RFA2 




(856237) 


(852294) 


(854063) 


(855404) 





In parenthesis are the corresponding gene IDs. 



pancreatectomy or sham operation, and received Exendin- 
4 or vehicle every 24 hours. 3-4 animals from each group 
were sacrificed at each time point of 12, 24 and 48 hr after 
operation, together with 4 animals that received no opera- 
tion, totaling 46 samples. Because the two studies each 
only contain a few time points, we combined their data for 
network modeling [58]. Replicate samples under the same 
condition at the same time point were averaged. 

The network modeling was performed on 36 genes 
manually collected from a recent review by Servitja and 
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Table 8 The 36 mouse genes chosen to reconstruct 
interaction networks during pancreas development and 
growth 



Acvrl (11477) 


Hes1 (15205) 


Nfe2l2 (18024) 


Pdx1 (18609) 


Anxa4 (11746) 


Hnf4a (15378) 


Nfkbl (18033) 


Pklr (18770) 


Bmp2 (12156) 


lapp (15874) 


Nfkbia (18035) 


Ppib (19035) 


Cbfb (12400) 


Ins2 (16334) 


Nkx2-2 (18088) 


Psen2 (19165) 


Chuk (12675) 


IsM (16392) 


Npm1 (18148) 


Rps4x (20102) 


Cryz (12972) 


Mafb (16658) 


Nr5a2 (26424) 


Slc2a2 (20526) 


Foxa2 (15376) 


Myo6 (17920) 


Nrp1 (18186) 


Stat3 (20848) 


Foxa3 (15377) 


Nckapl (50884) 


Onecutl (15379) 


Tcf1 (21405) 


Gck (103988) 


Neurog3 (11925) 


Pax6 (18508) 


Ugcg (22234) 



In parenthesis are the corresponding gene IDs. 



Ferrer [52], which are known to be important in pan- 
creas development. They are listed in Table 8. 
Digitization of gene expression data 

Expression data were further discretized into three 
levels. In each data set, we calculated the mean (u.) and 
standard deviation (SD) of expression across all time 
points for each gene. Each expression value is then 
assigned to 0, 1 or 2 according to whether the value is 
less than (i-SD, between (i-SD and (i+SD, or above 
(i+SD. 

Prior data of interaction and transcription binding 

Annotations of known yeast gene interaction were 
downloaded from the Biomolecular Interaction Network 
Database (BIND, http://bind.ca), a database designed to 
store full descriptions of interactions, molecular com- 
plexes and pathways [49]. BIND includes both directed 
(such as protein-DNA interaction) and un-directed 
(such as protein-protein interaction) interactions. There- 
fore when comparing to BIND annotations, we ignored 
direction. 

Simon et al studied the transcription regulation of yeast 
genes by 9 cell cycle regulating transcription factors (TF): 
Fkhl, Fkh2, Nddl, Mcml, Ace2, Swi5, Mbpl, Swi4, and 
Swi6, using the ChlP-chip technology [45]. These nine 
TFs are among the 107 cell cycle genes that we performed 
network modeling. The data were downloaded from 
http://staffa.wi.mit.edu/cgi-bin/young_public/navframe. 
cgi?s=17&f=downloaddata. For each TF, the study derived 
a binding p-value for each gene which reflects the likeli- 
hood that the TF binds to the promoter of this gene. We 
constructed a positive control target set for each TF that 
consists of those with p < 0.001, a negative control target 
set for each TF that consists of those with p > 0.1. Note 
that the transcription binding data provide directed 
information. 

Additional material 



Additional file 1: Predicted regulatory relationships missed by the 
plain BN. most established regulatory relationships missed by the plain 



BN involve two genes that share significant GO similarity and PubMed 
co-citation. 
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